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Abstract.  We  present  a  parallel  implementation  of  Gaussian  elimination 
without  pivoting  using  the  nested  dissection  ordering  for  solving  Ax  =  b  where  A 
is  an  N  x  N  symmetric  positive  definite  matrix.  If  the  graph  of  A  is  a 
finite  element  mesh  then  Birkhoff  and  George  and  Liu  have  shown  that  a  parallel 
complexity  of  0(^jy)  can  be  achieved  for  Gaussian  elimination  with  the  nested 
dissection  ordering.  Our  implementation  achieves  this  parallel  complexity  on  a 
two  dimensional  MIMD  processor  array  with  N  processors  and  nearest  neighbors 
interconnections.  Thus  nested  dissection  is  a  near  optimal  algorithm  for  this 
problem  on  this  interconnection  topology.  The  parallel  implementation  on  this 
architecture  requires  15&\/&)+  OQoga(y^))  parallel  floating  point  multiplica¬ 
tions.  It  is  faster  than  a  Kung-Leiserson  systolic  array  for  banded  matrices  for 
N  >  961,  and  faster  than  a  serial  implementation  for  N  as  small  as  9. 
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1.  Introduction.  With  the  advent  of  research  into  parallel  computer  ar¬ 
chitectures  numerical  algorithms  must  be  reexamined  in  terms  of  their  parallel 
complexity.  A  realistic  parallel  complexity  analysis  takes  the  communication  cost 
into  consideration,  so  a  machine  architecture  must  be  assumed.  There  are  two 
approaches  form  this  point.  Given  a  computer  architecture  the  parallel  complex¬ 
ity  for  competing  algorithms  on  that  architecture  can  be  compared.  With  the 
introduction  of  VLSI  technology  and  new  design  tools  special  purpose  computers 
can  also  be  economically  feasible.  Then  the  comparison  is  between  the  computa¬ 
tional  complexity  of  (algorithm,  computer  architecture)  pairs  for  solving  a  given 
problem.  This  latter  analysis  need  not  be  restricted  to  special  purpose  comput¬ 
ers.  Consideration  of  the  mix  of  problems  to  be  solved  on  a  more  general  purpose 
parallel  architecture,  and  the  computational  complexity  of  these  (algorithm,  ar¬ 
chitecture)  pairs,  may  help  indicate  a  cost  effective  parallel  architecture. 

Consider  the  problem  of  solving  the  linear  system  Ax  =  b,  where  A  is  a 
symmetric  positive  definite  N  x  N  matrix.  We  define  the  graph  of  A  to  be 
an  undirected  graph  G(A)  =  (V,E(A))  where  V  —  {1,2, . . . ,  AT)  and  E(A)  = 
{(t,  j')|o,7  /  0}.  Let  the  graph  of  A  be  isomorphic  to  a  y/N  x  y/N  grid  where 
each  vertex  has  edges  to  its  eight  nearest  neighbors,  a  finite  element  mesh,  or  to  a 
subset  of  such  a  graph.  See  Figure  1.  This  is  the  model  equation  used  by  George 
[3]  to  describe  the  nested  dissection  algorithm.  Such  problems  arise  when  solving 
a  second  order  elliptic  partial  differential  equation  in  two  space  dimensions.  The 
graph  of  the  matrix  reflects  a  uniform  finite  difference  or  finite  element  grid  in 
the  spatial  domain,  or  a  simple  mapping  of  such,  with  a  differential  operator 
approximation  that  only  involves  nearest  neighbors.  Finally,  let  n  —  y/N  and 
restrict  it  to  the  form  n  =  21  —  1  for  some  l  €  {1,2, . . .}.  This  model  problem 
is  used  to  describe  the  algorithm,  but  our  approach  can  easily  be  extended  to 
handle  more  general  problems. 

Our  goal  is  to  introduce  a  computer  architecture  and  a  mapping  onto  this 
architecture  which  achieves  the  parallelism  predicted.by  Birkhoff  and  George  [l], 
and  by  Liu  [10].  Their  results  give  a  parallel  complexity  of  O(n)  using  0(n2) 
processors  for  the  matrix  factorisation  A  =  LDL*  and  the  upper  and  lower  tri¬ 
angular  matrix  solves.  By  taking  into  account  the  communication  time  and  the 
time  required  for  program  and  data  loading  and  unloading,  a  complete  parallel 
complexity  analysis  can  be  made.  We  argue  for  the  usefulness  of  this  approach  by 
emphasising  the  generality  and  simplicity  of  the  computer  architecture,  and  the 
optimality  of  the  nested  dissection  algorithm  for  this  architecture.  An  algorithm 
similar  to  ours  has  been  presented  by  Gannon  [2].  He  uses  a  blend  of  Givens 
rotations  and  block  Gaussian  elimination  to  triangularise  the  matrix  (not  factor 
it),  and  has  somewhat  different  architectural  requirements.  His  algorithm  also 
involves  larger  constants  and  a  more  complex  program.  But  some  of  the  ideas  are 
the  same,  and  this  work  provides  a  detailed  look  at  an  implementation  of  these 
common  ideas. 
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In  sections  2  and  3  we  review  the  symmetric  factbrisation  and  the  nested 
dissection  algorithm.  In  section  4  we  introduce  our  architecture  and  how  to  use  it 
efficiently  to  solve  subproblems  arising  in  the  nested  dissection  algorithm.  Section 
5  describes  the  parallel  factorisation  algorithm.  Section  6  describes  the  parallel 
upper  and  lower  triangular  matrix  solves.  Section  7  discusses  practical  aspects 
of  using  the  algorithm,  and  section  8  analyses  the  efficiency  of  the  algorithm. 

2.  The  Symmetric  Factorisation.  We  solve  Ax  =  6  by  calculating  a 
symmetric  factorisation 

A  =  LDLX 

with  L  unit  lower  triangular  and  D  diagonal,  and  forward  and  backward  trian¬ 
gular  matrix  solves 


Lm  ss  6, 


L‘x  =  £>-**. 


Let  Ir  be  the  r  x  r  identity  matrix.  Let  /  represent  identity  matrices  whose 
sises  are  clear  by  context.  Define  A<°)  =  Aq  =  A  and  =  Iff.  Assume  we 
have  a  factorisation 

for  s'  >  1  where  is  unit  lower  triangular  and  A^*_1)  is  block  diagonal: 


i  is  an  (s’  —  1)  x  (s'  —  1)  unit  lower  triangular  matrix,  and  is  an 
(s'  —  1)  x  (s'  —  1)  diagonal  matrix.  C,_i  is  an  (N  —  s'  +  1)  x  (s'  —  1)  matrix. 
A*_i  is  an  (N  —  s'  +  1)  x  (N  —  s'  +  1)  matrix.  Label  the  elements  of  A,_j, 


where  d,-  is  a  scalar  and  u,  is  a  vector  of  length  N  —  s'; 
A*'-1*  can  be  factored, 


A('-i) 


where 


(2.1) 


and  0  is  the  sero  vector  of  length  N  —  s*. 
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Thus  the  factorisation  of  A  becomes 


Li  is  an  i  x  *  unit  lower  triangular  matrix,  and  C,  is  an  (N  —  *)  x  i  matrix. 

Since  A  is  symmetric  positive  definite,  is  nonsero,  and  this  process  de¬ 
scribes  an  algorithm  to  produce  the  symmetric  factorisation. 

3.  Nested  Dissection.  When  only  a  small  fraction  of  the  elements  of  A 
are  nonsero,  A  is  said  to  be  sparse.  When  a  sparse  matrix  is  factored,  fill-in  can 
occur:  the  triangular  factors  contain  nonzeroes  in  positions  where  A  has  seroes. 
An  increase  in  the  number  of  nonseroes  increases  the  work  in  computing  the 
factorisation  and  the  triangular  matrix  solves.  The  amount  of  fill-in  is  sensitive 
to  the  ordering  of  the  columns  of  the  matrix,  also  called  the  variable  ordering. 

Nested  dissection  is  a  symmetric  factorisation  A  =  LDL *  with  a  particular 
heuristic  for  the  variable  ordering  intended  to  minimise  fill-in  [4].'  At  least 
n3/6  +  0(n2)  multiplications  are  required  to  factor  A  for  our  model  problem 
[3]  [6].  The  nested  dissection  variable  ordering  leads  to  a  multiplicative  operation 
count  of  267/28  n3  +  0(n2  log2  n),  and  so  is  close  to  optimal. 

The  symmetric  factorisation  can  be  described  in  terms  of  the  graph  of  the 
matrix  A.  Each  vertex  of  the  graph  represents  a  variable  or  column  of  the  matrix, 
and  each  edge  between  the  vertices  v,  and  «y  represents  a  nonsero  a,y  or  ay,-.  Thus 
the  graph  of  the  matrix  describes  the  sparsity  pattern.  The  graph  of  the  matrix 
Ai,  produced  by  the  »th  step  of  the  symmetric  factorisation  (sec  equation  (2.1)),  is 
a  simple  modification  of  the  matrix  graph  for  remove  the  vertex  associated 

with  the  tll>  variable  and  add  edges  between  all  vertices  that  had  been  connected 
to  it,  if  they  don’t  already  share  an  edge.  The  addition  of  edges  reflects  additional 
nonseroes  in  the  triangular  factor  L.  We  denote  both  the  graph  manipulation 
and  the  step  in  the  matrix  factorisation  by  the  phrase  eliminating  the  vertex. 

The  nested  dissection  heuristic  divides  the  graph  of  the  model  problem  into 
five  subsets:  a  cross  dividing  the  grid  into  four  equal  parts,  and  the  four 
(n  —  i)/2  x  (n  —  l)/2  subgrids.  Sec- Figure  2.  By  numbering  (eliminating)  the 
cross  vertices  last  the  vortices  in  the  subgrids  can  be  eliminated  without  edges 
forming  between  vertices  in  different  subgrids.  The  vertices  of  the  subgrids  are 
recursively  ordered  by  the  same  rule. 

If  the  inital  cross  of  vertices  is  considered  to  be  at  level  l  there  will  be 
four  crosses  at  level  l  —  1  and  4,-k  crosses  at  level  k.  For  the  model  problem 
l  =  log 2  (n  +  1).  All  vertices  at  a  given  level  are  eliminated  before  those  at  the 
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next  higher  level.  With  this  variable  ordering  the  matrix  A  has  the  form : 

Qi  CJ\ 

Qi  C\ 

Qi  Cl  . 

Qi  Cl 

Ci  Cj  C$  C4  St ) 

Qi  is  an  ( N  — 1)/4  x  (JV— 1)/4  matrix  corresponding  to  the  vertices  in  quadrant  t. 
Si  is  a  (2n  —  1)  x  (2n—  1)  matrix  corresponding  to  the  cross  of  vertices  separating 
the  four  subgrids.  C,  is  a  (2 n  —  1)  x  (N  -  l)/4  matrix  corresponding  to  the  edges 
between  subgrid  i  and  the  cross.  Each  Qi  in  turn  is  a  matrix  with  the  same 
structure  as  A. 

The  LDLt  factorisation  of  A  for  the  model  problem  using  the  nested  dissec¬ 
tion  ordering  is: 

(Qi  Of 

Qi  Cl 

Qi  C‘, 

Qi  Cl 

\  Ci  Cy  C3  C4  Si 

Li 

La 

L3 

,  .  .  k* 

Ci  Cj  C3  C 4  L, 

where  Li  is  a  lower  triangular  matrix  with  the  same  genera]  structure  as  L. 

The  fcth  level  vertices  reside  in  4*~*  crosses,  each  enclosed  in  a  box  composed 
of  vertices  at  higher  levels.  Each  cross  has  2^k+l^  -  3  vertices,  and  each  spoke 
of  the  cross  has  2^fc“1l  —  1  vertices.  To  order  the  vertices  in  a  cross  number  the 
vertices  in  the  right  horisontal  spoke  from  right  to  left,  continue  numbering  in 
the  left  spoke  from  right  to  left,  and  finish  numbering  in  the  vertical  separator 
from  the  top  to  the  bottom.  See  Figure  3.  To  eliminate  a  cross: 

a)  Eliminate  the  left  and  right  horizontal  spokes  of  the  cross.  These  are 
independent  groups  of  vertices. 

b)  Eliminate  the  remaining  vertical  separator. 

The  elimination  of  the  crosses  at  the  klU  level  are  all  independent.  We  will  refer 
to  these  as  ktb  level  subproblems. 

The  elimination  of  the  vertices  at  lower  levels  couples  each  of  the  horizontal 
spoke  vertices  to  all  of  the  vertices  in  the  box  surrounding  it,  which  is  made 
up  of  the  vertical  separator  and  half  of  the  vertices  in  the  box  surrounding  the 
entire  cross,  and  to  no  vertices  outside  of  the  .box.  After  the  elimination  of  the 
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horizontal  spokes  the  vertical  separator  vertices  are  coupled  to  all  of  the  vertices 
in  the  box  surrounding  it.  Let  p  be  the  number  of  vertices  in  a  spoke.  Let  q  be 
the  number  of  vertices  in  the  surrounding  box.  Then  the  elimination  of  a  spoke 
of  vertices  from  a  kth  level  cross  corresponds  to  a  partial  factorization  of  a  dense 
matrix  F: 


»■>'■(?  ?)-uv :)(?  s)( 


L*  D~1L~1C* 

0  I 


) 


5  is  a  p  x  p  symmetric  positive  definite  matrix,  and  B  =  B  —  CS-1C‘  is  the  q  x  q 
matrix  B  modified  by  the  elimination  of  the  spoke.  The  matrix  F  is  a  selection 
of  rows  and  columns  from  the  matrix  Ar  (  for  r  =  4J[l  —  2-fc+2(l  -  2“*)])  from 
which  zero  elements  have  been  removed. 

Using  the  same  notation  as  in  section  2  the  factorization  proceeds  by  calcu¬ 
lating 

(S  ,)'«("  7) 

“  (fc  “H 

where 

F(,-D  =  ,  f>(i)  —  fD'~l  *  6t) 

\  Vi  FiJ  V  6  FJ 

and  Fi  —  Fi  —  Let  fj£  be  the  (As,  j)  element  of  F^  and  be  the  (k,j) 

element  of  F.  So  for  each  column  j  in  F^'~1^ ,  j  >  i ,  we  calculate 

/-  \  /(*~M 

/(«)  _  _  V.V«)fc— i  (-  \  .  .  _  Ai-l)  _  £ji _ A*-t)  t  jl  v>  • 

Jki  ~  Jki  ^  Vt,*Ax— «  —  J kj  (i-x)Jki  tor  K  >  J 

'  it 

since  f'f* -l)  is  symmetric. 

The  partial  factorization  stops  at 


(c,  ?)• 


The  vectors  {v*}  are  identical  to  the  vectors  {u,  }  described  in  section  2,  for  some 
j,  with  the  zero  elements  removed.  The  elimination  of  the  vertical  separator  is 
described  similarly. 
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4.  The  Architecture  and  Its  Uses.  The  machine  is  an  n  x  n  grid  of 
processors.  Each  processor  can  communicate  with  its  four  nearest  neighbors. 
Processors  on  one  side  of  the  grid  have  access  to  an  external  communication  net¬ 
work.  Sec  Figure  4.  This  is  an  MIMD  architecture,  and  each  processor  has  both 
instruction  and  data  memory,  and  can  do  floating  point  addition,  subtraction, 
multiplication,  and  division.  The  data  memory  can  hold  up  to  3.5n  +  O(l)  float¬ 
ing  point  numbers,  while  the  program  memory  can  hold  a  program  whose  size  is 
independent  of  n  . 

The  memory  requirements  are  necessary  for  our  algorithm.  The  program 
consists  of  steps  of  m  (m  <  4)  simultaneous  writes,  followed  by  up  to  4  —  m 
simultaneous  reads,  followed  by  some  arithmetic.  The  writes  and  reads  are  not  on 
the  same  lines,  and  the  written  values  can  remain  on  the  data  path  throughout  the 
step  to  insure  a  correct  read  by  the  destination  processor.  This  architecture  scales 
since  the  length  of  the  interprocessor  communication  lines  remains  fixed  as  n 
increases.  There  is  neither  a  global  instruction  path,  nor  a  global  clock,  therefore 
all  wires  external  to  the  processors  are  short.  But  the  memory  requirements  do 
grow  linearly  in  n  for  our  algorithm. 

The  nested  dissection  algorithm  described  in  section  3  calculates  the  sym¬ 
metric  factorization  of  A  by  partially  factoring  a  sequence  of  small  dense  matrices. 
Systolic  algorithms  perform  very  well  on  dense  problems  of  these  sorts  [9],  and 
these  algorithms  can  be  run  within  the  topology  of  our  processor  array.  Again  let 
p  be  the  number  of  vertices  in  a  spoke  or  separator  to  be  eliminated.  Let  q  be  the 
number  of  vertices  in  the  surrounding  box.  We  treat  the  partial  factorization  of 
equation  (3.1)  as  two  operations.  First  is  a  reduction  phase  which  calculates  the 
first  p  columns  of  the  lower  triangular  factor  ({5,/d,}),  and  the  first  p  columns  of 
F ({d,}).  The  second  is  an  update  which  produces  the  last  q  columns  of  F^v\ 

A  systolic  architecture  sufficient  to  perform  the  reduction  phase  is  a  p(p+l)/2 
element  triangular  processor  array.  See  Figure  5.  We  will  refer  to  processors  by 
their  (column, row)  coordinates.  Assume  that  processor  (*,  1)  has  access  to  the 
matrix  elements  (on  and  below  the  diagonal)  of  column  ».  By  sending  the  elements 
of  column  1  from  processor  (1, 1)  to  the  left  through  the  processors  in  row  1,  we 
can  calculate 

/(i)  _  AO)  (vi)fc-i(Si)y-i 

"  d, 

in  processor  (j,  1)  for  all  k  G  {j,p  +  q}-  If  the  elements  are  sent  tc  processor 
(/,  2)  as  soon  as  they  arc  available  we  can  begin  to  calculate 

A2)  _  Ai)  (g2)fc-a({?2)j-a 

hi  ~  hi  J2 

immediately.  This  works  since  processor  (2, 1)  is  producing  its  results  first,  re¬ 
ceiving  the  data  from  processor  (1, 1)  directly,  and  its  results  are  the  d3  and  v3 
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the  three  other  directions  at  the  beginning  of  the  next  step.  Each  processor  uses 
the  data  it  receives  if  appropriate.  Processors  which  already  have  their  solutions 
ignore  all  incoming  data,  so  the  wave  front  does  not  escape  the  square.  The 
horizontal  spokes  do  not  let  information  pass  across  them  vertically,  so  the  xv 
wave  front  stops  at  the  horizontal  spokes  except  for  a  pulse  that  continues  up  the 
vertical  separator.  As  soon  as  zp  passes  though  processor  p  —  1  it  has  its  solution 
zp_i.  The  step  after  xv  has  been  sent  on  its  way  processor  p  —  1  sends  out 
xp-i,  starting  its  own  wave  front.  The  two  wave  fronts  do  not  conflict,  one  being 
properly  nested  inside  the  other.  Two  steps  later  processor  p  —  2  will  start  its  own 
wave  front  with  the  xp_2  datum.  See  Figure  13  for  a  pictorial  description  of  this 
process.  This  continues  until  the  center  of  the  cross  is  ready  to  broadcast  its  value. 
The  horizontal  spokes  will  send  the  information  both  upwards  and  downwards, 
as  per  the  convention,  and  the  broadcast  will  cover  the  entire  square.  But  the 
next  wave  front  is  again  trapped  by  the  horizontal  spokes,  this  time  in  the  upper 
half  of  the  square,  as  are  all  of  the  rest  of  the  wave  fronts  in  step  (a). 

Let  one  backward  solve  step  consist  of  either  1  or  3  writes,  followed  by  1 
read,  followed  by  1  multiplication  and  1  subtraction. 

Lemma  6.1.  Part  (a)  for  a  level  k  backward  solve  subproblem  takes  3  •  2fc  —  6 
backward  solve  steps. 

Proof:  The  vertical  separator  contains  2fc  —  1  processors.  It  takes  2fc  —  2  steps 
for  the  first  information  to  arrive  at  the  top  of  the  separator.  It  takes  another 
2k  —  3  steps  for  the  rest  of  the  vertical  separator  information  to  arrive.  This  top 
processor  then  waits  one  step  before  broadcasting  its  value,  which  takes  2*  —  2 
steps  to  cover  the  top  half  of  the  square.  I 

Part  (b)  is  practically  the  same  algorithm  except  that  the  vertical-horizontal 
conventions  of  before  are  reversed.  If  data  from  a  wave  front  enters  a  processor 
from  the  right  or  left  it  sends  it  on  to  the  left  or  right  respectively  at  the  next 
step,  but  if  it  enters  from  the  top  or  bottom  the  processor  sends  it  on  in  the 
other  three  directions.  The  horizontal  spokes  start  with  the  leftmost  processors. 
The  right  and  left  spokes  are  backward  solved  concurrently  since  nothing  passes 
across  the  now  inactive  vertical  separator  processors. 

Lemma  6.2.  Part  (b)  of  a  level  k  backward  solve  subproblem  takes  2k+i  —  6 
backward  solve,  steps. 

Proof:  The  horizontal  spokes  each  contain  2fc_1  —  1  processors.  It  takes 
2fc~*  —2  steps  for  the  rightmost  processor  of  each  spoke  to  sec  its  first  information. 
2*~f  -  3  steps  later  it  has  seen  the  rest  of  the  wave  fronts  arrive.  After  waiting 
one  step  this  last  processor  broadcasts  its  value,  which  takes  2k  —  2  steps  to  cover 
its  half  of  the  square.  I  . 

The  subproblcms  on  a  given  level  can  all  be  solved  in  parallel.  Level  l  is  not 
a  special  case  for  the  backsolve,  and  level  1  is  trivial  as  there  is  nothing  to  do. 
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Theorem  8.1.  The  forward  solve  algorithm  takes  22 n  +  (6r  —  26)  log2  (n  + 1)  — 
4 r  —  24  forward  solve  steps  and  17.5 n  4-  (9r  —  24)  log2  (n  + 1)  —  15r  —  18  commu¬ 
nication  steps. 

Proof:  This  result  follows  from  an  analysis  analogous  to  the  factorization 
operation  count,  using  similar  level  1  and  level  l  improvements.  The  deliberate 
transpose  of  data  in  step  b)vii)  of  the  factorization  is  not  needed,  nor  is  any  other 
such  data  movement.  I 

If  the  forward  solve  is  calculated  during  the  factorization  it  does  not  need  the 
row  entries  of  L.  This  reduces  the  data  storage  requirements  from  3.5n  floating 
point  numbers  to  2n.  The  separate  forward  solve  algorithm  can  be  modified  to 
take  advantage  of  the  same  storage  savings,  but  it  significantly  increases  the  data 
movement  required  and  involves  a  more  complex  program.  Notice  that  r  —  1 
additional  right  hand  sides  only  increase  the  work  by  0(r  log2  (n  +  1))  steps,  yet 
there  is  an  overhead  of  0(n)  to  get  started. 

For  now  consider  solving  for  a  single  right  hand  side.  The  vector  z, 
i  =  L~lb,  has  been  calculated  and  z}  stored  in  processor  j.  If  not  calculated 
during  the  factorization  we  next  calculate  the  diagonal  solve,  y y  =  Zj/dj,  in  all 
the  processors  simultaneously.  Finally  we  calculate  the  backward  solve, 

"  i 

xi  =  Vi  -  X)  x  (*> )»->*«• 

•«i+ 1  V 

Once  an  x,-  is  calculated  it  is  broadcast  to  all  of  the  processors  containing 
a  yy  ,  j  <  t,  such  that  (tiy/dy),_y  ^  0.  Again  we  have  subproblcms  arising  from 
the  variable  ordering.  Within  a  square  grid' bordered  by  processors  which  already 
contain  their  associated  solution,  Xi,  we  calculate  the  solutions  associated  with 
a  cross  of  processors.  The  results  are  sent  to  processors  in  the  four  subgrids  of 
the  square.  When  they  have  all  been  received  crosses  in  these  subgrids  can  be 
backward  solved  in  parallel  and  the  results  in  turn  broadcast  to  their  subgrids. 
We  overlap  the  backward  solve  of  the  cross  and  the  broadcast  of  the  results,  and 
divide  this  process  into  two  parts: 

a)  Backward  solve  and  broadcast  the  vertical  separator. 

b)  Backward  solve  and  broadcast  the  left  and  right  horizontal  spokes  si¬ 
multaneously. 

At  the  start  of  part  (a)  the  processor  at  the  bottom  of  the  vertical  separator 
(call  this  vertex  p)  already  has  its  solution  xpi  and  sends  this  result  to  its  three 
neighboring  processors  inside  the  square.  At  the  next  step  these  processors  pass 
xv  to  their  neighbors  who  haven’t  already  received  it.  The  information  continues 
to  move  out  in  a  wave  front  through  the  grid.  If  the  information  is  received 
from  above  or  below,  the  receiving  processor  continues  sending  it  in  the  original 
direction.  If  it  is  received  from  the  left  or  the  Tight  the  processor  sends  it  out  in 
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We  again  solve  dense  subproblems,  the  forward  solve  of  a.  ktb  level  cross  of  ver¬ 
tices,  using  systolic  algorithms.  In  the  factorization  a  spoke  of  vertices  is  reduced 
using  a  triangular  array  of  processors.  Number  these  vertices  as  u  to 
u+p—  1.  We  forward  solve  for  {{2„)j\j  6  {v, . . .  ,u+p—  1}}  for  all  s  6  {1,. ..  ,r} 
using  the  same  processor  array.  Each  spoke  processor  j  loads  the  idle  processors 
above  it  with 

€  {v, •  •  •  ~  1  +  i}}i 

with  ( Qi/di)j-i  in  row  i  —  u.  Processor  v  sends  the  values  {(za)„}  through 
row  1,  modifying  the  values  {(&«);}  from  the  other  processors.  This  finishes  the 
calculation  for 

>  9  —  1,  •••«*■}. 

each  element  of  which  is  sent  down  row  2  as  soon  as  it  is  calculated  to  continue  the 
reduction.  Similarly,  {(ia)|/+l}is  sent  through  row  t  —  1,  finishing  the  calculation 
of  {(sa)v+i+i}.  As  in  the  factorisation  the  calculation  is  also  performed  in  a 
triangular  processor  array  below  the  horizontal  spoke.  This  process  takes  p  —  1 
communication  steps*  to  load  the  processor  array,  and  2p  +  r  forward  solve  steps 
consisting  of  2  writes,  2  reads,  1  multiplication,  and  1  subtraction.  See  Figure  11 
for  a  snapshot  of  the  algorithm  with  u  —  1. 

In  the  factorisation  we  update  the  column  entries  of  a  box  of  vertices  one 
side  at  a  time.  The  forward  solve  updates  the  elements  of  the  right  hand  sides 
corresponding  to  these  same  vertices  using  the  same  rectangular  processor  array. 
For  example,  consider  the  update  of  the  left  side  using  the  results  of  a  spoke 
forward  solve, 

{(*-)>  |  »  =  1,- -  -  ,r ;  j  =  1+p}, 

stored  at  the  top.  Each  processor  j  in  the  left  side  fills  its  corresponding  row 
with  the  elements 

{(ti i/di)j-i  |  »  €  {v, . . .  ,v  -  1  +  p}>  , 

placing  in  the  processor  directly  below  where  is  stored.  The 

data  movement  is  the  same  as  in  the  factorisation,  with  the  entries  to  be  mod¬ 
ified,  {(&/.)>},  moving  from  the  left  to  the  right,  and  the  information  doing  the 
modifying,  {(£*)<},  moving  from  the  top  to  the  bottom  This  takes  p  communi¬ 
cation  steps  to  load  {(u,/d,)},  p  +  m  +  r-  2  forward  solve  stc|>s  (m  being  the 
number  of  vertices  from  the  left  side  being  updated),  and  p  t-r  -  I  communication 
steps  to  return  the  results.  See  Figure  12  for  a  snapshot  of  the  algorithm  with 
v  as  1,  p  =  3,  and  m  =  4. 

*  This  assumes  that  writing  the  same  data  in  two  different  directions  costs 
the  same  as  writing  it  only  once  during  a  communication  step.  Otherwise  2p  —  2 
communication  steps  are  required. 
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factorisation,  except  that  some  efficiencies  can  be  exploited  in  levels  1  and  /.  In 
both  of  these  the  given  algorithm  will  work,  but  in  level  l  there  is  no  surrounding 
box  of  active  processors  for  the  single  subproblem,  nor  other  subproblems  with 
which  it  must  synchronise,  so  it  can  simply  ignore  the  parts  of  the  algorithm 
which  are  meaningless  for  it.  This  takes  15  *  21-1  4-  3  factorisation  steps  and 
8  •  21-1  —  1  communication  steps  for  step  (a),  and  6  •  2Z_1  +  3  factorisation  steps 
and  10  *  2t~1  + 1  communication  steps  for  step  (b).  The  level  1  subproblems  can 
also  be  done  more  efficiently,  requiring  a  complexity  of  41  factorisation  steps.  See 
[11]  for  more  details.  This  leads  us  to: 

Theorem  5.1.  The  parallel  factorisation  of  an  N  x  N  symmetric  positive  def¬ 
inite  matrix  A  whose  matrix  graph  is  isomorphic  to  a  y/N  x  y/N  finite  element 
mesh  takes  63 n  —  24  log,  n  —  115  factorisation  steps  and  46n  —  19  log,  n  —  128 
communication  steps,  if  ns  y/N  =  21  —  1. 

0.  The  Parallel  Triangular  Matrix  Solves.  Having  factored  A  into 
LDL1  we  still  must  solve  three  problems, 

Li  =  5  forward  solve 

Dy  =  i  diagonal  solve 

£*£  =  y  backward  solve, 

to  solve  the  system  Ax  =  6.  After  the  factorisation  row  j  of  L,  D,  and  L*  are  in 
processor,/.  We  assume  that  5/  has  also  been  loaded  into  processor  j. 

The  forward  solve  can  be  calculated  in  conjunction  with  the  factorisation  by 
adding  (b*  1)  as  an  extra  row,  and  its  transpose  as  an  extra  column.  In  general, 
if  we  had  r  right  hand  sides,  B  =  [5*  . . .  6rj ,  the  algorithm  produces 

(A  Bt\_(  L  OW D  0 \(L*  D~iL~lB*\ 

VB  l)~\BL-tD~i  Ij\0  i)\0  I  )' 

where  I  ~  I  —  BA~lBt  is  never  calculated.  The  diagonal  solve  is  provided  here 
also.  For  r  right  hand  sides  the  cost  is 

r(10  log,  (n  +  1)  —  2)  factorisation  steps 

and 

r(9  log,  (n  +  1)  —  5)  communication  steps. 

The  vectors  {b„}  are  not  always  be  available  at  the  time  of  factorisation. 
For  completeness  sake  we  describe  a  forward  solve  algorithm,  which  is  based  on 
mimicking  the  factorisation  algorithm.  The  calculation  is 

i-1  *  . 
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vi)  Update  the  right. 


Return  (and  load  data  for  (vii)). 


vli) 


Transpose  the 
update  information. 


Figure  10:  Phases  of  step  (b)  of  the  factorisation.  Information  flow  is 
indicated  by  the  arrows.  Data  sources  are  represented  by  the  heavy  solid 
lines.  Stippled  areas  indicate  processors  engaged  in  arithmetic. 
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require  simultaneous  access  to  shared  processors.  This  organization  also  allows 
step  (a)  of  all  the  snbproblcms  in  level  k  to  be  calculated  concurrently.  Not  all 
subproblems  are  of  the  form  described.  Those  crosses  on  the  edge  of  the  grid  are 
missing  one  or  more  sides  of  the  surrounding  box.  But  the  crucial  processors, 
the  idle  ones  in  the  four  quadrants  of  the  cross,  are  always  there.  Processors 
solving  border  subproblems  wait  at  times  to  synchronize  with  the  calculations  of 
the  interior  subproblems. 

Step  (b),  the  elimination  of  the  vertical  separator,  is  very  similar  to  step  (a). 
Refer  to  Figure  10  for  a  pictorial  description  and  below  for  a  description  of  the 
differences  from  step  (a). 

i)  Send  the  column  data  in  the  vertical  separator  processors  to  the  left. 
Store  the  data  in  the  idle  processors  just  inside  the  left  side  of  the 
surrounding  square.  This  is  done  so  that  the  reduction  algorithm  can 
use  all  of  the  idle  processors  inside  the  square. 

ii)  Reduce  the  vertical  separator.  The  results  are  stored  in  two  places,  in 
the  idle  processors  next  to  the  right  side  of  the  surrounding  square  and 
in  the  idle  processors  next  to  the  bottom  side  of  the  square. 

iii)  Update  the  vertices  on  the  bottom.  The  update  information,  moving 
from  right  to  left,  is  saved  in  the  idle  processors  next  to  the  left  side  of 
the  surrounding  box  of  active  processors. 

hr)  Update  the  vertices  on  the  left.  The  update  information,  moving  from 
the  bottom  to  the  top,-  is  saved  in  the  idle  processors  next  to  the  top  of 
the  box. 

v)  Update  the  vertices  on  the  top. 

vi)  Update  the  vertices  on  the  right. 

vii)  Transpose  the  update  information.  We  still  need  to  store  the  values 

|/  €  the  vertical  separator ,  k  >  j} 

d> 

for  each  processor  k  in  the  vertical  separator.  First  send  the  information 
up  into  the  interior  processors  from  the  bottom.  Next  the  processors 
to  the  right  of  the  separator  send  these  values  left  into  the  vertical 
separator,  followed  by  the  processors  on  the  left  sending  their  values 
right  into  the  vertical  separator. 

Lemma  5.2.  Step  (b)  at  level  k  takes  62  •  2k~1  —  11  factorisation  steps  and 
51  •  2fc_1  —  11  communication  steps. 

This  organization  allows  all  of  the  level  k  subproblcms  to  perform  step  (b) 
in  parallel.  The  border  subproblems  again  represent  a  straightforward  reduction 
of  the  given  algorithm,  with  processors  waiting  when  appropriate.  This  gives 
all  the  information  required  to  calculate  the  parallel  complexity  of  the  parallel 
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Update  the  vertices  on  the  right.  Include  the  top  right  vertex  of  the 
surrounding  rectangle,  but  not  the  bottom  right  one.  The  rectangular 
processor  array  includes  the  active  processors  oh  the  top  and  the  right. 
The  update  information  flows  from  the  bottom  up,  the  column  entries  to 
be  updated  move  from  the  right  to  the  left,  and  the  results  end  up  in  the 
far  left  column  of  idle  processors,  plus  one  active  processor  at  the  top. 
The  results  (and  { (t?^- /cf, ) *_> })  axe  then  sent  back  to  the  appropriate 
processors. 

Update  the  vertices  on  the  bottom.  Include  the  bottom  right  vertex, 
but  not  the  bottom  left.  The  rectangular  processor  array  consists  of  the 
active  processors  on  the  bottom  and  the  bottom  half  of  the  active  pro¬ 
cessors  on  the  right,  and  the  idle  processors  below  the  horisontal  spoke. 
The  update  information  flows  from  the  left  to  the  right,  the  column  en¬ 
tries  to  be  updated  flow  from  the  bottom  up,  and  the  results  end  up  in 
the  processors  just  below  the  horisontal  spoke  processors.  The  results 
(and  {(vj/dj)k-j})  are  then  sent  back  to  the  appropriate  processors. 
Simultaneously  the  update  information  held  in  the  processors  on  the 
left  side  above  the  horisontal  spoke  is  sent  to  the  right  side.  The  values 

{ — (t?y)fc_y  |j  €  the  horisontal  spoke ,  k  >  j} 

«/ 

are  also  sent  to  each  processor  k  in  the  horisontal  spoke  while  the  update 
proceeds  in  the  processors  below  the  spoke, 
v)  Update  the  vertices  on  the  top.  Include  the  top  left  vertex,  but  not 
the  top  right  one.  The  rectangular  processor  array  includes  the  active 
processors  on  the  top  and  the  top  half  of  the  active  processors  on  the 
left.  The  update  information  flows  from  the  right  to  the  left,  the  column 
entries  to  be  updated  flow  from  the  top  to  the  bottom,  and  the  results 
end  up  in  the  processors  just  above  the  hoiisontal  spoke  processors. 
The  results  (and  are  then  sent  back  to  the  appropriate 

processors. 

Define  a  factorisation  step  to  be  two  simultaneous  writes,  followed  by  two 
simultaneous  reads,  followed  by  either  one  division  or  two  multiplications  and 
one  addition.  Define  a  communication  step  to  be  one  write  followed  by  one  read. 

Lemma  5.1.  Step  (a)  of  the  factorization  at  level  k  takes  43  •  2k~l  -  13  factor- 
izatioa  steps  and  28  •  2k_1  -  8  communication  steps. 

Proof.  A  simple  counting  argument  and  the  exact  specification  of  the  reduc¬ 
tion  and  update  algorithms  are  all  that  is  needed.  See  [11]  for  details.  I 

This  particular  organisation  of  step  (a)  . allows  both  the  right  and  left  hori¬ 
sontal  spokes  to  be  eliminated  simultaneously  because  the  two  eliminations  never 


iii) 


hr) 


16 


Nested  Dissection  on  a  Multiprocessor 


iv) 
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Figure  9:  Phases  of  step  (a)  of  the  factorisation.  Information  flow  is  indi¬ 
cated  by  the  arrows.  Data  sources  are  represented  by  the  heavy  solid  lines. 
Stippled  areas  indicate  processors  engaged  in  arithmetic. 
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complete,  with  results  residing  in  the  processors  in  column  1.  Figure  8  shows  a 
snapshot  of  the  architecture  near  the  beginning  of  the  algorithm  for  a  particular 
ordering  (and  with  p  =  3  and  m  =  4).  Again,  this  algorithm  will  be  described  in 
more  detail  in  [ll]. 

5.  The  Parallel  Factorization.  We  use  a  parallel  processor  whose  inter* 
connection  topology  approximates  the  graph  of  the  matrix  A.  Each  vertex  i  is 
represented  by  a  processor,  which  initially  contains  the  nonsero  matrix  elements 
(on  and  below  the  diagonal)  of  column  t.  After  the  ttb  vertex  is  eliminated  the 
processor  will  contain  the  nonseroes  in  the  *tb  rows  of  L,Dt  and  Once  the 
vertex  is  eliminated  the  processor  is  free  to  be.  used  in  the  elimination  of  other 
vertices,  and  we  call  it  idle.  Until  then  the  processor  is  called  active.  We  use  these 
idle  processors  to  implement  the  systolic  algorithms  introduced  in  the  previous 
section.  Henceforth  we  will  use  the  terms  vertex  and  processor  interchangeably, 
using  whichever  is  most  appropriate.  We  will  also  refer  to  the  reduction  or  up* 
date  phases  in  a  spoke  elimination  as  reducing  or  updating  the  vertices  associated 
with  the  values  being  calculated. 

Step  (a)  of  the  cross  elimination  for  a  subproblem  at.  level  k  is  pictorially 
represented  in  Figure  9,  and  described  below. 

i)  Reduce  the  horisontal  spoke.  The  reduction  is  done  simultaneously 
in  idle  processors  above  and  below  the  spoke  processors.  The  results 
{(dj,iij/dj)}  are  saved  in  four  different  groups  of  processors,  in  the  left 
edges  of  the  two  triangular  processor  arrays  and  in  the  idle  processors 
next  to  the  top  and  the  bottom  of  the  box  defined  by  the  surrounding 
active  processors. 

ii)  Update  the  vertices  on  the  left.  Include  the  bottom  left  vertex  of  the 
surrounding  rectangle,  but  not  the  top  left  one.  The  rectangular  pro¬ 
cessor  array  to  calculate  the  update  consists  of  the  active  processors  on 
the  left  and  the  bottom,  and  all  of  the  interior  idle  processors.  The 
update  information  {{dj,  Vj/dj)}  flows  down  from  the  top,  the  column 
entries  to  be  updated  move  from  left  to  right,  and  the  results  end  up  in 
the  far  right  column  of  idle  processors  (including  one  active  processor  at 
the  bottom).  The  update  information  is  saved  in  the  horisontal  spoke 
processors  as  it  moves  back  through  them.  Next  we  need  to  return  the 
results  of  the  update  to  their  original  processors.  We  also  need  to  save 
the  values 


{(— vy)*_/|j  €  the  horisontal  spoke,  k  >  j) 


at  processor  k.  These  elements  represent  part  of  the  kth  row  of  L. 
Fortunately  the  update  algorithm  left  these  elements  in  the  processors 
between  processor  k  and  the  processor  holding  fc’s  updated  column  ele¬ 
ments.  When  the  results  are. sent  back,  the  set  of  values  {(t»y/dy)*_y) 
can  be  sent  to  processor  k  as  well. 
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elements  needed  to  calculate  fjV.  Tins  continues  with  processor  (m,m  —  1)  cal¬ 
culating  dm  and  vm,  while  row  m  calculates  for  j  >  m.  In  3p  +  q  steps  the 
reduction  phase  is  completed.  More  detail  on  this  algorithm  will  be  available  in 
a  Stanford  technical  report  [11].  See  Figure  6  for  a  snapshot  of  the  algorithm. 
This  systolic  algorithm  incorporates  no  new  ideas  [8]  [9],  but  it  is  what  is  needed 
to  calculate  the  reduction  phase  efficiently  on  our  machine. 

The  update  phase  requires  access  to  the  vectors  {(d,,  tJ,/d,)|t  =  l,...,p} 
which  are  calculated  in  the  reduction  phase.  Call  this  the  update  information. 
Since  the  factorization  preserves  symmetry,  only  the  elements  on  and  below  the 
diagonal  need  to  be  updated.  So  for  column  j  of  the  matrix  only  the  elements  d, 
and  (v i/di)k-i ,  k  >  j  are  used. 

An  architecture  to  update  a  subset  of  size  m  of  the  q  columns  is  an  mx  (p+1) 
rectangular  array  of  processors.  See  Figure  7.  Assume  each  processor  in  row  1, 
except  processor  (p+1, 1),  has  access  to  (d, ,  v^/d,)  for  some  distinct  *  €  {1, . . . ,  p}. 
Assume  each  processor  in  column  p  +  1  has  access  to  the  elements  on  and  below 
the  diagonal  of  a  column  j  of  the  matrix  for  some  j  €  {p  +  1, . . .  ,p  +  ?}. 

The  algorithm  begins  with  processor  (p+1,1)  sending  its  column  entries  to 
the  right  through  the  processors  in  row  1.  Since  the  matrix  column  assigned  to 
processor  (p  +  1, 1)  is  arbitrary,  the  row  1  processors  assume  that  it  is  column 
p+1  and  expect  to  calculate 


Ap)  _  r.  _  V' 


as  fi.v+ i>  3  €  {p  +  1, ...,p  +  g},  moves  through  row  1.  If  instead  processor 
(p  +  1,1)  contains  matrix  column  s,  then  for  the  first  s  —  p  —  1  steps  it  sends  a 
NIL  value  to  tell  the  row  1  processors  that  nothing  should  be  done  during  these 
steps.  As  soon  as  a  real  number  is  sent  the  processors  can  recognize  this  and 
start  calculating 

y(p)  _  J  _ 

di 

for  successive  j  €  {«,...,  p  +  g}.  Processor  (r,  1) ,  r  €  { 1, . . .  ,p)  ,  must  wait  p  —  r 
steps  for  its  first  value  from  processor  (p  +  1, 1)  and  remains  delayed  that  much 
throughout  llic  algorithm.  As  soon  as  the  jtU  value  from  processor  (p  +  1, 1) 
readies  processor  (r,  I),  (vr/dr)J+,,_r  is  no  longer  needed  (with  the  exceptions 
that  dr  and  (vr/dr)8_r  arc  saved  locally)  and  can  be  sent  to  processor  (r,  2). 
Thus  if  processor  (p  +  1,2)  starts  sending  its  NILs  and  column  entries  through 
row  2  one  step  after  processor  (p  +  1,1)  starts,  row  2  can  update  the  matrix 
column  associated  with  processor  (p+  1, 2).  The  same  analysis  holds  true  for  the 
other  rows  of  the  processor  array,  where  each  row  begins  one  step  later  than  the 
previous  row.  In  p  +  m  +  q  steps  the  update  of  these  m  columns  of  the  matrix  is 
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So  we  have 

Theorem  6.2.  The  backward  solve  for  the  model  problem  takes 
10  •  n  -  12  log3(n  +  1)  —  8  backward  solve  steps. 

The  backward  solve  algorithm  on  a  spoke  is  similar  to  the  Kung  and  Leiserson 
linear  systolic  array  algorithm  [9].  This  is  then  embedded  into  our  broadcast 
algorithm. 

The  backward  solve  algorithm  wastes  some  time  broadcasting  results  to  pro¬ 
cessors  where  they  are  not  needed.  The  simplicity  of  the  program  for  the  algo¬ 
rithm  outweighs  the  small  amount  of  savings  which  are  available.  A  more  serious 
drawback  is  the  lack  of  savings  for  multiple  right  hand  sides.  The  complete  algo¬ 
rithm  must  be  rerun  for  each  y«.  This  follows  from  the  backward  solve  algorithm 
being  based  on  a  linear  systolic  array  algorithm. 

Gannon  describes  a  different  backward  solve  algorithm  which  is  suitable  for 
multiple  right  hand  sides.  His  approach  uses  a  two  dimensional  systolic  array 
algorithm  and  only  broadcasts  the  results  when  and  where  they  are  needed.  The 
backward  solve  of  a  cross  is  again  a  subproblem.  Preliminary  to  this  the  right 
hand  sides  corresponding  to  these  vertices  are  updated  by  the  results  available 
in  the  surrounding  box  of  vertices.  This  update  is  divided  into  eight  parts,  one 
for  each  side  of  the  box  updating  either  the  horisontal  spokes -or  the  vertical 
separator.  The  division  avoids  conflict  with  neighboring  subproblems  for  access 
to  the  processors  of  the  box,  as  in  the  factorisation.  The  algorithm  then  follows 
with  the  backward  solve  of  the  cross. 

We  improve  on  Gannon’s  algorithm  by  again  mimicking  the  factorisation  al¬ 
gorithm.  Start  by  updating  the  vertical  separator  vertices  with  respect  to  the  four 
sides  of  the  enclosing  box.  Each  part  consists  of  loading  the  interior  processors 
with  elements  from  Ll,  then  calculating  the  update  in  a  fashion  similar  to  that  in 
the  forward  solve  algorithm.  The  results  arc  not  returned  to  the  vertical  separa¬ 
tor  processors  between  updates  by  the  four  sides.  The  vertical  separator  is  then 
backsolvcd  using  a  triangular  processor  array  ns  in  the  forward  solve  algorithm. 
When  these  results  have  been  returned  to  the  vertical  separator  processors  the 
two  horisontal  spokes  are  updated  in  parallel  by  the  four  sides  of  the  boxes  they 
are  enclosed  in,  one  side  at  a  time.  The  data  movement  is  again  analogous  to  that 
of  the  factorisation.  This  step  ends  with  each  horisontal  spoke  being  backsolvcd 
using  a  triangular  processor  array.  Thus  this  backward  solve  algorithm  imitates 
the  factorisation  algorithm  with  the  ordering  of  its  basic  steps  reversed,  just  as 
the  forward  solve  algorithm  uses  the  same  ordering  ns  in  the  factorisation. 

Analysing  this  variant  of  Gannon’s  backward  solve  gives: 

Theorem  6.3.  A  variation  of  Gannon’s  backward  solve  algorithm  on  this  archi¬ 
tecture  takes  21n  +  (lOr  —  32)  log3  (n  +  1)  -  13r  —  8  forward  solve  steps 
and  17n  -I-  (llr  —  31)  log3  (n  +  1)  —  20r  —  3  communication  steps. 

This  backward  solve  algorithm  requires  a  more  complex  program,  and  this 
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would  effect  its  running  time,  but  for  multiple  right  hand  sides  it  soon  becomes 
cost  effective. 

7.  Practicalities.  In  the  previous  two  sections  we  gave  detailed  operation 
counts  for  the  algorithm.  The  constants  in  the  operation  counts  can  be  unproved 
to  some  degree  but  at  the  cost  of  increased  program  overhead,  program  size,  and 
programming  difficulty.  We  are  still  working  out  the  program  specifics,  but  it  is 
clear  that  the  program  sise  will  be  independent  of  n,  and  that  the  same  code  can 
reside  in  all  processors. 

We  assumed  that  every  processor  along  one  side  of  the  processor  array  can 
communicate  with  something  external  to  the  processor  array.  For  these  commu¬ 
nication  lines  to  be  used  in  parallel  requires  that  the  processor  array  be  communi¬ 
cating  with  another  parallel  processor  capable  of  n  simultaneous  reads  or  writes, 
or  with  a  serial  processor  with  a  communication  channel  which  is  enough  faster 
than  the  processor  array’s  cycle  time  that  it  appears  to  be  a  parallel  processor 
(for  some  fixed  n).  For  the  program  load  it  doesn’t  matter  whether  it  is  loaded  in 
parallel  through  n  processors,  or  whether  it  is  loaded  into  one  processor  to  start 
with.  If  the  program  consists  of  m  words  it  will  take  m  +  n  —  1  communication 
steps  to  load  the  processor  array  given  parallel  external  communication,  and  it 
will  take  at  most  2m  +  2n  —  3  communication  steps  if  only  one  processor  can 
communicate  externally. 

The  data  load  sends  9  floating  point  numbers  to  each  processor,  representing 
the  nonseroes  in  each  column  of  A.  Another  floating  point  number  per  proces¬ 
sor  is  loaded  representing  6.  If  parallel  external  communication  is  available  the 
loading  the  matrix  requires  n  +  8  communication  steps,  and  loading  the  right 
hand  side  requires  n  communication  steps.  If  only  one  processor  can  communi¬ 
cate  externally  the  data  load  degrades  to  9n3  communication  steps  for  loading 
the  matrix  and  n3  for  loading  b.  This  dependence  on  n3  will  not  dominate  the 
cost  of  solving  Ax  =  b  until  n  is  larger  than  2G0  +  16/p,  where  p  is  the  ratio  of 
the  cost  of  intcrprocessor  communication  to  floating  point  arithmetic.  Another 
related  problem  is  that  the  calculation  of  the  entries  of  A  and  b  can  dominate  the 
cost  of  solving  Ax  =  b  for  small  to  moderate  n  even  in  serial  machines.  While  the 
speed  up  of  any  part  of  this  process  is  not  to  be  ignored,  for  significant  savings 
the  generation  of  the  matrix  and  the  right  hand  side  may  need  to  be  moved  onto 
the  processor  array. 

The  retrieval  of  the  solution  2  costs  n  communication  steps  given  parallel 
external  communication,  and  n3  communication  steps  if  the  data  is  Tunneled 
through  one  processor.  This  bottleneck  is  not  avoidable.  If  the  elements  of  the 
matrix  factorisation  are  desired  even  the  assumed  parallel  external  communi¬ 
cation  is  not  sufficient.  L  contains  0(n2  log2  (n  +  1))  nonzeroes  [3],  with  0(n) 
nonzeroes  in  each  processor  on  the  level  l  cross.  Removing  this  information  re¬ 
quires  0(n  log3  (n+1))  communication  steps. 

8.  Conclusions  and  Comparisons.  From  sections  5  and  6  we  can  cal- 
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culatc  the  total  amount  of  work  required  to  solve  Ax  =  5.  Assume  that  inter¬ 
processor  communication  within  the  factorisation  and  triangular  solve  steps  is 
no  cheaper  than  the  cost  within  a  pure  communication  step.  Assume  that  two 
multiplications  and  one  addition  cost  more  than  one  division.  We  define  syn¬ 
chronisation  logic  and  other  housekeeping  duties  to  be  step  overhead.  This  gives 
us 

Theorem  8.1.  Asymptotic  costs  for  solving  AS  —  b  are 


[126  +  172p  +  109<r]n  +  0(log2  (n  +  1)) 
[22  +  61.5p  +  39.5o]n  +  0(log2  (fi  +  1)) 
2 

[10  +  30p  +  10o]n  +  0(log2  (n  +  1)) 


for  the  factorisation 
for  the  forward  solve 
for  the  diagonal  solve 
for  the  backward  solve 


Each  step  represents  a  Boating  point  multiplication  and  a  Boating  point  addition 
or  subtraction,  o  is  the  relative  cost  of  the  step  overhead  compared  to  the  com¬ 
bined  cost  of  the  multiplication  and  the  addition,  and  p  is  the  relative  cost  of  a 
write  and  a  read  compared  to  the  cost  of  a  multiplication  and  an  addition. 


Proof:  Simple  arithmetic  using  Theorems  5.1,  6.1,  and  6.2  suffices  to  give 
the  result.  This  result  is  an  overestimate  due  to  the  conservative  combination  of 
multiplications  with  additions  and  writes  with  reads.  I 

A  similar  analysis  for  multiple  right  hand  sides  is  straight  forward,  where  now 
the  alternative  backward  solve  algorithm  is  used. 

Corollary  8.1.  The  asymptotic  efficiency  E  of  this  algorithm  for  solving  Ax  =  6 
on  this  architecture  can  be  bounded. 


4  loga  (n  +  1) 
’ll  n 


Eb.  € 


/  4  log2  (n  +  1)  4  log2  (n  +  1) 
\15  n  ’5  n 


for  the  factorisation 

for  the  forward  solve 
for  the  diagonal  solve 
for  the  backward  solve 


For  the  solution  using  the  separate  forward  solve  algorithm, 


For  the  solution  using  the  combined  factorisation/forward  solve  algorithm, 
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Proof:  We  define  the  efficiency  of  a  parallel  algorithm  to  be 

Complexity  of  aerial  algorithm 
g  _  Complexity  of  parallel  algorithm 
Number  of  processors 

This  is  an  approximation  to  the  usual  definition  of  efficiency  [7]  which  compares 
the  time  spent  in  executing  the  serial  and  parallel  versions  of  the  algorithm  rather 
than  their  complexities.  For  example,  we  ignore  the  memory  access  times  in  both 
the  serial  and  parallel  algorithms.  It  is  reasonable  to  assume  that  p,<r  €  (0, 1) , 
which  provides  the  upper  and  lower  bounds  for  the  efficiency  estimates.  The  exact 
values  will  depend  on  the  architecture.  The  serial  complexity  of  the  factorisation 
is 

2A7 

^nJ  +  0(nJlo*a(n  +  l)) 
and  the  serial  complexity  of  the  triangular  solves  is 

QO 

— naloga  (n  +  l)+  0(n2) 

[3].  The  results  then  follow  from  Theorem  8.1.  I 

We  achieve  linear  speedup*  for  the  solution  of  Ax  =  b,  although  the  triangu¬ 
lar  matrix  solves  are  not  as  parallelisable  with  this  algorithm  and  architecture. 
We  expect  a  and  p  to  be  fairly  small  and  expect  the  upper  bounds  on  the  efficien¬ 
cies  in  Corollary  8.1  to  be  good  approximations.  While  this  algorithm  provides 
linear  speedup,  on  the  average  less  than  one  twelfth  of  the  processors  are  working 
at  any  one  time. 

This  loss  of  efficiency  is  due  to  two  factors.  We  partition  a  kth  level  subprob¬ 
lem  into  small  enough  parts  to  be  solvable  using  only  the  local  idle  processors, 
and  we  synchronize  the  execution  of  these  parts  with  neighboring  subproblcms  to 
avoid  conflicts  of  access  to  the  shared  processors.  The  division  of  the  subproblcm 
into  parts  influences  the  efficiency  by  increasing  the  overhead  incurred  in  filling 
and  emptying  the  pipelines  of  the  systolic  algorithms.  It  also  incurs  a  cost  in 
pure  communication  steps  to  arrange  the  data  before  and  after  each  part.  The 
synchronization  cost  stems  from  similar  parts  in  different  subproblems  requiring 
different  amounts  of  work.  The  most  expensive  case  is  assumed  to  insure  no 
conflict  between  subproblcms.  There  is  also  a  percentage  of  loss  due  to  the  in- 
tcrproccssor  communication  associated  with  each  step  involving  arithmetic.  This 
is  assumed  to  be  more  expensive  than  the  corresponding  memory  access  which 
occurs  in  the  serial  version. 

*  The  speedup  is  the  efficiency  times  the  number  of  processors  being  used. 
The  speedup  is  linear  (in  the  number  of  processors)  if  the  efficiency  is  independent 
of  the  number  of  processors. 
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The  parallel  algorithm  is  not  obviously  faster  than  the  serial  algorithm  for 
small  n.  The  parallel  algorithm  spends  time  in  unproductive  work  in  the  pure 
communication  steps  and  the  synchronization  between  subproblems.  The  asymp¬ 
totic  analysis  implies  that  the  parallel  algorithm  is  faster  for  n  >  7  if  both  p  and 
a  are  less  than  .7.  But  for  small  n  the  waste  is  less  and  the  lower  order  terms 
become  more  important.  For  small  p  and  a  the  parallel  algorithm  is  faster  for  n 
as  small  as  3. 

A  competing  parallel  algorithm  is  the  Kung-Leiserson  systolic  factorisation 
algorithm  for  banded  dense  matrices  [9].  This  systolic  architecture  requires  n3 
processors  and  the  algorithm  finishes  in  3na+n  steps  since  the  minimal  bandwidth 
for  the  model  problem  is  n  [3].  Each  step  consists  of  three  writes,  three  reads,  two 
multiplications,  and  one  addition.  This  is  close  to  the  basic  step  in  our  algorithm. 
Rrom  the  asymptotic  analysis  we  see  that  the  parallel  nested  dissection  algorithm 
will  be  faster  for  n  >  31.  The  processors  for  the  Kung-Leiserson  systolic  array  do 
not  need  much  local  memory,  but  the  systolic  array  must  be  fed  data  in  parallel 
from  an  external  source. 

In  conclusion,  this  algorithm  efficiently  parallelises  the  nested  dissection  algo¬ 
rithm  using  a  simple  MIMD  architecture.  The  architecture  allows  linear  speedup 
of  what  is  a  near  optimal  serial  algorithm.  Likewise,  the  parallel  algorithm  solves 
a  problem  which  is  globally  dependent  on  its  data  in  the  same  order  of  time  it 
takes  to  communicate  a  single  datum  from  one  side  of  the  processor  grid  to  the 
other.  Thus  this  architecture  and  algorithm  are  well  matched  for  solving  the 
model  problem.  As  larger  multiprocessors  are  built  this  algorithm,  and  its  vari¬ 
ations  for  more  general  problems,  should  be  useful  when  a  direct  factorisation  is 
desired  to  solve  problems  similar  to  the  model  problem. 
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